TF-IDF Embedding¶
- Compute the TF-IDF embeddings for a given sentence.
- TF-IDF Embedding: The TfidfVectorizer computes a TF-IDF vector for the sentence, resulting in an array of weights for each term.
- Store the embeddings in a vector database using FAISS.
- FAISS Storage: FAISS stores these embeddings in an index, allowing efficient similarity searches.
- Display the embedded data in a simple plot to show the embeddings and their index positions.
- Retrieval and Plotting: The stored vector is visualized by plotting each TF-IDF weight with its index. The tfidf_faiss function returns the indices and distances of nearest neighbors.
In [ ]:
%pip install -q faiss-cpu langchain matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed. databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed. ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.4 which is incompatible. scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.4 which is incompatible. numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.26.4 which is incompatible. mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible. Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import faiss
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_faiss_visualization(sentence):
# Step 1: Generate TF-IDF embeddings
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([sentence])
tfidf_array = tfidf_matrix.toarray()
feature_names = vectorizer.get_feature_names_out() # Retrieve terms corresponding to features
# Step 2: Initialize FAISS index
dimension = tfidf_array.shape[1]
index = faiss.IndexFlatL2(dimension) # Using L2 distance metric OR #index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(tfidf_array.astype('float32')) # Normalize for better retrieval
# Step 3: Add embeddings to FAISS
index.add(tfidf_array.astype('float32')) # FAISS requires float32 type
# Step 4: Retrieve the nearest neighbor for each vector (self-similarity)
distances, indices = index.search(tfidf_array, k=1) # Self-search to get its own index and distance
# Step 5: Visualize stored embeddings and nearest neighbor distances
plt.figure(figsize=(9, 5))
plt.bar(range(dimension), tfidf_array[0], color='skyblue')
plt.xlabel("TF-IDF Feature Index")
plt.ylabel("TF-IDF Weight")
plt.title("TF-IDF Embedding Weights for Each Feature in the Sentence")
plt.show()
# Step 6: Display embedded data in a table with indices, distances, and terms
table_data = {
"Index": list(range(dimension)),
"TF-IDF Weight": tfidf_array[0],
"Nearest Neighbor Distance (IndexFlatL2)": [distances[0][0]] * dimension,
"Term": feature_names
}
table_df = pd.DataFrame(table_data)
# Print the table
print("\nTF-IDF Embedding Table with Index, Term, TF-IDF Weight, and Nearest Neighbor Distance\n")
display(table_df)
# Test the function with an example sentence
tfidf_faiss_visualization("""The financial services sector has experienced robust growth due to the adoption of digital banking and financial technology.
Our firm's investment banking division saw a revenue increase of 20% year-over-year, driven by higher client acquisition and new advisory services.
For the fiscal year ending in 2023, the company reported revenue of $30 million with an EBITDA margin of 30%.
The debt-to-equity ratio remains low at 0.4, providing a stable foundation for future investments and expansion in asset management.
Additionally, the company's market share in wealth management grew by 7%, attributed to expanded service offerings and improved client retention.""")
TF-IDF Embedding Table with Index, Term, TF-IDF Weight, and Nearest Neighbor Distance
Index | TF-IDF Weight | Nearest Neighbor Distance (IndexFlatL2) | Term |
---|---|---|---|
0 | 0.07254762501100116 | 0.0 | 20 |
1 | 0.07254762501100116 | 0.0 | 2023 |
2 | 0.14509525002200233 | 0.0 | 30 |
3 | 0.07254762501100116 | 0.0 | acquisition |
4 | 0.07254762501100116 | 0.0 | additionally |
5 | 0.07254762501100116 | 0.0 | adoption |
6 | 0.07254762501100116 | 0.0 | advisory |
7 | 0.07254762501100116 | 0.0 | an |
8 | 0.29019050004400465 | 0.0 | and |
9 | 0.07254762501100116 | 0.0 | asset |
10 | 0.07254762501100116 | 0.0 | at |
11 | 0.07254762501100116 | 0.0 | attributed |
12 | 0.14509525002200233 | 0.0 | banking |
13 | 0.14509525002200233 | 0.0 | by |
14 | 0.14509525002200233 | 0.0 | client |
15 | 0.14509525002200233 | 0.0 | company |
16 | 0.07254762501100116 | 0.0 | debt |
17 | 0.07254762501100116 | 0.0 | digital |
18 | 0.07254762501100116 | 0.0 | division |
19 | 0.07254762501100116 | 0.0 | driven |
20 | 0.07254762501100116 | 0.0 | due |
21 | 0.07254762501100116 | 0.0 | ebitda |
22 | 0.07254762501100116 | 0.0 | ending |
23 | 0.07254762501100116 | 0.0 | equity |
24 | 0.07254762501100116 | 0.0 | expanded |
25 | 0.07254762501100116 | 0.0 | expansion |
26 | 0.07254762501100116 | 0.0 | experienced |
27 | 0.14509525002200233 | 0.0 | financial |
28 | 0.07254762501100116 | 0.0 | firm |
29 | 0.07254762501100116 | 0.0 | fiscal |
30 | 0.14509525002200233 | 0.0 | for |
31 | 0.07254762501100116 | 0.0 | foundation |
32 | 0.07254762501100116 | 0.0 | future |
33 | 0.07254762501100116 | 0.0 | grew |
34 | 0.07254762501100116 | 0.0 | growth |
35 | 0.07254762501100116 | 0.0 | has |
36 | 0.07254762501100116 | 0.0 | higher |
37 | 0.07254762501100116 | 0.0 | improved |
38 | 0.2176428750330035 | 0.0 | in |
39 | 0.07254762501100116 | 0.0 | increase |
40 | 0.07254762501100116 | 0.0 | investment |
41 | 0.07254762501100116 | 0.0 | investments |
42 | 0.07254762501100116 | 0.0 | low |
43 | 0.14509525002200233 | 0.0 | management |
44 | 0.07254762501100116 | 0.0 | margin |
45 | 0.07254762501100116 | 0.0 | market |
46 | 0.07254762501100116 | 0.0 | million |
47 | 0.07254762501100116 | 0.0 | new |
48 | 0.29019050004400465 | 0.0 | of |
49 | 0.07254762501100116 | 0.0 | offerings |
50 | 0.07254762501100116 | 0.0 | our |
51 | 0.07254762501100116 | 0.0 | over |
52 | 0.07254762501100116 | 0.0 | providing |
53 | 0.07254762501100116 | 0.0 | ratio |
54 | 0.07254762501100116 | 0.0 | remains |
55 | 0.07254762501100116 | 0.0 | reported |
56 | 0.07254762501100116 | 0.0 | retention |
57 | 0.14509525002200233 | 0.0 | revenue |
58 | 0.07254762501100116 | 0.0 | robust |
59 | 0.07254762501100116 | 0.0 | saw |
60 | 0.07254762501100116 | 0.0 | sector |
61 | 0.07254762501100116 | 0.0 | service |
62 | 0.14509525002200233 | 0.0 | services |
63 | 0.07254762501100116 | 0.0 | share |
64 | 0.07254762501100116 | 0.0 | stable |
65 | 0.07254762501100116 | 0.0 | technology |
66 | 0.435285750066007 | 0.0 | the |
67 | 0.2176428750330035 | 0.0 | to |
68 | 0.07254762501100116 | 0.0 | wealth |
69 | 0.07254762501100116 | 0.0 | with |
70 | 0.2176428750330035 | 0.0 | year |